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Voted Kernel Regularization 


CORINNA CORTES, PRASOON GOYAL, VITALY KUZNETSOV, AND MEHRYAR MOHRI 


Abstract. This paper presents an algorithm. Voted Kernel Regularization , that provides the 
flexibility of using potentially very complex kernel functions such as predictors based on much 
higher-degree polynomial kernels, while benefitting from strong learning guarantees. The success 
of our algorithm arises from derived bounds that suggest a new regularization penalty in terms of the 
Rademacher complexities of the corresponding families of kernel maps. In a series of experiments 
we demonstrate the improved performance of our algorithm as compared to baselines. Eurthermore, 
the algorithm enjoys several favorable properties. The optimization problem is convex, it allows for 
learning with non-PDS kernels, and the solutions are highly sparse, resulting in improved classifica¬ 
tion speed and memory requirements. 


1. Introduction 

The hypothesis returned by learning algorithms sueh as SVMs [Cortes and Vapnik, 1995] and 
other algorithms for whieh the representer theorem holds is a linear eombination of functions 
K{x,-), where K is the kernel function used and x is a training sample. The generalization guar¬ 
antees for SVMs depend on the sample size and the margin, but also on the complexity of the 
kernel function K used, measured by its trace [Koltchinskii and Panchenko, 2002]. 

These guarantees suggest that, for a moderate margin, learning with very complex kernels, such 
as sums of polynomial kernels of degree up to some large d may lead to overfitting, which fre¬ 
quently is observed empirically. Thus, in practice, simpler kernels are typically used, that is small 
ds for sums of polynomial kernels. On the other hand, to achieve a sufficiently high performance 
in challenging learning tasks, it may be necessary to augment a linear combination of such func¬ 
tions K{x, ■) with a function K'{x, •), where K' is possibly a substantially more complex kernel, 
such as a polynomial kernel of degree d' ^ d. This flexibility is not available when using SVMs 
or other learning algorithms such as kernel Perceptron [Aizerman et ah, 1964, Rosenblatt, 1958] 
with the same solution form: either a complex kernel function K' is used and then there is risk 
of overfitting, or a potentially too simple kernel K is used limiting the performance that could be 
achieved in some tasks. 

This paper presents an algorithm. Voted Kernel Regularization , that precisely provides the flexi¬ 
bility of using potentially very complex kernel functions such as predictors based on much higher- 
degree polynomial kernels, while benefitting from strong learning guarantees. In a series of exper¬ 
iments we demonstrate the improved performance of our algorithm. 
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We present data-dependent learning bounds for this algorithm that are expressed in terms of the 
Rademaeher eomplexities of the reprodueing kernel Hilbert spaees (RHKS) of the kernel fune- 
tions used. These results are based on the framework of Voted Risk Minimization originally intro- 
dueed by Cortes et al. [2014] for ensemble methods. We further extend these results using loeal 
Rademaeher eomplexity analysis to show that faster eonvergenee rates are possible when the spee- 
trum of the kernel matrix is eontrolled. The sueeess of our algorithm arises from these bounds that 
suggest a new regularization penalty in terms of the Rademaeher eomplexities of the eorresponding 
families of kernel maps. Therefore, it beeomes erueial to have a good estimate of these eomplexity 
measures. We provide a thorough theoretieal analysis of these eomplexities for several eommonly 
used kernel elasses. 

Besides the improved performanee and the theoretieal guarantees Voted Kernel Regulariza¬ 
tion admits a number of additional favorable properties. Our formulation leads to a eonvex opti¬ 
mization problem that ean be solved either via Linear Programming or using Coordinate Deseent. 
Voted Kernel Regularization does not require the kernel funetions to be positive-definite or even 
symmetrie. This enables the use of mueh rieher families of kernel funetions. In partieular, some 
standard distanees known not to be PSD sueh as the edit-distanee and many others ean be used 
with this algorithm. 

Yet another advantage of our algorithm is that it produees highly sparse solutions providing 
greater effieieney and less memory needs. In that respeet, Voted Kernel Regularization is similar 
to so-ealled norm-1 SVM [Vapnik, 1998, Zhu et ah, 2003] and Any-Norm-SVM [Dekel and Singer, 
2007] whieh all use a norm-penalty to reduee the number of support veetors. However, to the best 
of our knowledge these regularization terms on their own has not led to performanee improvement 
over regular SVMs [Zhu et ah, 2003, Dekel and Singer, 2007]. In eontrast, our experimental results 
show that Voted Kernel Regularization algorithm ean outperform both regular SVM and norm-1 
SVM, and at the same time signifieantly reduee the number of support veetors. In other work 
hybrid regularization sehemes are eombined to obtain a performanee improvement [Zou, 2007]. 
Possibly this teehnique eould be applied to our Voted Kernel Regularization algorithm as well 
resulting in additional performanee improvements. 

Somewhat related algorithms are learning kernels or multiple kernel learning and has been exten¬ 
sively investigated over the last deeade by both algorithmie and theoretieal studies [Lanekriet et ah, 
2004, Argyriou et ah, 2005, 2006, Srebro and Ben-David, 2006, Lewis et ah, 2006, Zien and Ong, 
2007, Mieehelli and Pontil, 2005, Jebara, 2004, Bach, 2008, Ong et ah, 2005, Ying and Campbell, 
2009, Cortes et ah, 2010]. In learning kernels, training data is used to select a single kernel out 
of the family of convex combinations of p base kernels and to learn a predictor based on just one 
kernel. In contrast in Voted SVM, every training point can be thought of as representing a differ¬ 
ent kernel. Another related approach is Ensemble SVM [Cortes et ah, 2011], where a predictor for 
each base kernel is used and these predictors are combined in to define a single predictor, these two 
tasks being performed in a single stage or in two subsequent stages. The algorithm where the task 
is performed in a single stage bears the most resemblance with our Voted Kernel Regularization . 
However the regularization is different and most importantly not capacity-dependent. 

The rest of the paper is organized as follows. Some preliminary definitions and notation are 
introduced in Section 2. The Voted Kernel Regularization algorithm is presented in Section 3 and 
in Section 4 we provide strong data-dependent learning guarantees for this algorithm showing that 
it is possible to learn with highly complex kernel classes and yet not overfit. In Section 4, we also 
prove local complexity bounds that detail how faster convergence rates are possible provided that 
the spectrum of the kernel matrix is controlled. Section 5 discusses the implementation of the Voted 
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Kernel Regularization algorithm ineluding optimization proeedures and analysis of Rademacher 
eomplexities. We eonelude with experimental results in Seetion 6. 


2. Preliminaries 


Let X denote the input spaee. We eonsider the familiar supervised learning seenario. We assume 
that training and test points are drawn i.i.d. aeeording to some distribution V over X x { —1,+1} 
and denote by S' = ((xi, i/i),..., {xm, Dm)) a training sample of size m drawn aeeording to V^. 

Let p > 0. For a funetion / taking values in M, we denote by R{f) its binary elassifieation error, 
by Rsif) its empirieal error, and by Rs,p{f) its empirieal margin error for the sample S: 


«(/) = , E ll„,w<„], Mf) 


^ [i-y/( 3 ;)<o]) and-Rp(/) 

{x,V)r^S 


{x,y)r^S 


where the notation {x, y) S indicates that {x, y) is drawn according to the empirical distribution 
defined by S. We will denote by ^s{H) the empirical Rademacher complexity of a hypothesis 
set H on the set S of functions mapping X to M, and by ^Rrn{H) the Rademacher complexity 
[Koltchinskii and Panchenko, 2002, Bartlett and Mendelson, 2002]: 


^siH) 


— E sup aih{xi) 

m 

2 = 1 




E 


6\s{h) 


where the random variables cTj are independent and uniformly distributed over {—1, +1}. 


3. The Voted Kernel Regularization Algorithm 

In this section, we introduce the Voted Kernel Regularization algorithm. Let Ki, ..., Kp be 
p positive semi-definite (PSD) kernel functions with Kk = sup 2 .g;t> ^JKk{x,x) for all k G [l,p]. 
We consider p corresponding families of functions mapping from A’ to R, iLi,..., iFp, defined by 
Hk = {x ^ XKk{x, x'): x' E X}, where the sign accounts for two possible ways of classifying a 
point x' E X. The general form of a hypothesis / returned by the algorithm is the following: 

m p 

f = ^^(^k,jKk{-,Xj), 
j=l k=l 

where akj E R for all j and k. Thus, / is a linear combination of hypotheses in H^s. This form 
with many as per point is distinctly different from that of learning kernels with only one a per 
point. Since the families Hk are symmetric, this linear combination can be made a non-negative 
combination. Our algorithm consists of minimizing the Hinge loss on the training sample, as 
with SVMs, but with a different regularization term that tends to penalize hypotheses drawn from 
more complex HkS more than those selected from simpler ones and to minimize the norm-1 of the 
coefficients a^j. Let denote the empirical Rademacher complexity of Hk'. = y{s{Hk). Then, 
the following is the objective function of Voted Kernel Regularization : 

z m p \ TTi p 

max I 0,1-piPj EE j +/3)\ak,j\, (1) 

i=l j=l k=l ' j=l k=l 

where A > 0 and /3 > 0 are parameters of the algorithm. We will adopt the notation Ak = Xrk + 
to simplify the presentation in what follows. 

Note that the objective function F is convex: the Hinge loss is convex thus its composition with 
an affine function is also convex, which shows that the first term is convex; the second term is 
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convex as the absolute value terms with non-negative eoeffieients; and F is eonvex as the sum of 
these two eonvex terms. Thus, the optimization problem admits a global minimum. Voted Kernel 
Regularization returns the funetion / defined by (3) with eoeffieients a. = {ak,j)k,j minimizing F. 

This formulation admits several benefits. First, it enables us to learn with very eomplex hy¬ 
pothesis sets and yet not overfit, thanks to Rademaeher eomplexity-based penalties assigned to 
eoeffieients assoeiated to different H^s. We will see later that the algorithm thereby defined bene¬ 
fits from strong learning guarantees. Notice further that the penalties assigned are data-dependent, 
whieh is a key feature of the algorithm. Seeond, observe that the objeetive funetion (6) does not 
require the kernels to be positive-definite or even symmetrie. Funetion F is eonvex regardless 
of the kernel properties. This is a signifieant benefit of the algorithm whieh enables to extend its 
use beyond what algorithms sueh as SVMs require. In partieular, some standard distanees known 
not to be PSD sueh as the edit-distanee and many others eould be used with this algorithm. Another 
advantage of this algorithm eompared to standard SVM and other ^ 2 -rcgularized methods is that 
£i-norm regularization used for Voted Kernel Regularization leads sparse solutions. The solution 
cx is typieally sparse, whieh signifieantly reduees predietion time and the memory needs. 

Note that hypotheses h E FI^ are defined by h{x) = Kk{x, x') where x' is an arbitrary element 
of the input spaee X. However, our objeetive only ineludes those Xj that belong to the observed 
sample. We show that in the ease of a PDS kernel, there is no loss of generality in that as we now 
show. Indeed, observe that for x' E X we ean write $^( 2 ;') = w -f w*-, where is a feature map 
assoeiated with the kernel Kk and where w lies in the span of <I)fc(T;i),..., and is in 

orthogonal eompliment of this subspace. Therefore, for any sample point x* 

Kk{xi,F) = (<I)(xi),<I>(x'))w, = ($(xi),w)^, + (<I>(xi), 

m m 

i=i i=i 

whieh leads to objeetive (1). Note that sinee seleeting —Kk(-, Xj) with weight ak,j is equivalent to 
seleeting Kk{-, Xj) with —akj, which accounts for the absolute value on the in the regular¬ 
ization term. 

The Voted Kernel Regularization algorithm has some eonneetions with other algorithms previ¬ 
ously deseribed in the literature. In the absenee of any regularization, that is A = 0 and /9 = 0, it 
reduees to the minimization of the Hinge loss and is therefore of eourse elose to the SVM algo¬ 
rithm [Cortes and Vapnik, 1995]. For A = 0, that is when disearding our regularization based on 
the different eomplexity of the hypothesis sets, the algorithm eoineides with an algorithm originally 
deseribed by Vapnik [1998][pp. 426-427], later by several other authors starting with [Zhu et ah, 
2003], and sometimes referred to as the norm-1 SVM. 

4. Learning Guarantees 

In this seetion, we provide strong data-dependent learning guarantees for the Voted Kernel Reg¬ 
ularization algorithm. 

Let F denote conv(lJ^^^ Hk), that is the family of funetions / of the form / = 
where ol = (gi, ..., ut) is in the simplex A and where, for eaeh t E [1, T], Flk^ denotes the 
hypothesis set eontaining ht, for some kt E [1, p]. Then, the following learning guarantee holds for 
all / G [Cortes et ah, 2014]. 

Theorem 1. Assume p > 1. Fix p > 0. Then, for any <5 > 0, with probability at least 1 — <5 over 
the choice of a sample S of size m drawn i.i.d. according to V^, the following inequality holds for 


f = Ylt=i Oitht e 7: 
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R[!) < RsM) R 

P t=l 


2 

P 



+ 



logP ^ logf 

m 2m 


Thus, R(f) < RsM) + jEf.i «.«».(*■) + O (\/^log[^]). 

Theorem 1 can be used to derive VKR objective and we provide full details of this deriva¬ 
tion in Appendix B. Furthermore, the results of Theorem 1 can further be improved using local 
Rademacher complexity analysis showing that faster rates of convergence are possible. 


Theorem 2. Assume p > 1. Fix p > 0. Then, for any <5 > 0, with probability at least 1 — <5 over 
the choice of a sample S of size m drawn Ltd. according to V^, the following inequality holds for 
till f = Ylt=i t^tht ^ 7 for any K > 1: 


Rif) 


-j^RsAf) < 6K-^at^miHA 

P^i'^ + -^)m 
p 2 40JTlogp 


+ 5k'^ 


+ 5K 


m 


m 


logp 

m 


Thus, for K 


2, RU) < 2RsM) + 






The proof of this result is given in Appendix A. Note that 0{logm/y/m) in Theorem 1 is 
replaced with 0{\ogm/m) in Theorem 2. For full hypothesis classes HkS, may be on the 

order of 0(1/y/m) and will dominate the bound. However, if we use localized classes Tffc(r) = 
{/i G TTfc: E[h'^] < r} then for certain values of r* local Rademacher complexities iR.rniHki'f*)) E 
0{l/m) leading to even stronger learning guarantees. Furthermore, this result leads to an extension 
of Voted Kernel Regularization objective: 

/ m p \ m p 

max (o,l-yiyj EE +EE(^«’"("‘W) + '3)Kil. (2) 

i=l ^ j=l k=l ' j=l k=l 

which is optimized over a and parameter s is set via cross-validation. In Section 5.3, we provide 
an explicit expression for the local Rademacher complexities of PDS kernel functions. 


5. Optimization Solutions 

In this section, we propose two different algorithmic approaches to solve the optimization prob¬ 
lem (1): a linear programming (LP) and a coordinate descent (CD) approach. 


5.1. Linear Programming (LP) formulation. This section presents a linear programming ap¬ 
proach for solving the Voted Kernel Regularization optimization problem (1). Observe that by 
introducing slack variables the optimization can be equivalently written as follows: 


min 


m p 


m 


I Oik,j I 

i=l j=l k=l 


m p 

- «.>i-EE ak,jyiyjKkixi,Xj),'ii E [l,m]. 

j=l k=l 
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Next, we introduce new variables a^.- > 0 and Q such that Then, for 

any k and j, \ak,j\ can be rewritten as \akj\ < + a'^j. The optimization problem is therefore 

equivalent to the following: 

-l m m p 

+ Okj) 

i=l j=l k=l 

m p 

6 > 1 - - a^j)yiyjKk{xi,Xj),yi e[l,m\, 

j=l k=l 

since conversely, a solution with ak,j = ^ verifies the condition = 0 or = 0 for 

any k and j, thus akj = al^ when j > 0 and Ofcj = (^kj when j < 0. This is because if 
5 = min(a^^., a^-) > 0, then replacing a^- with ■ — 5 and a'^ ■ with a'^ ■ — 5 would not affect 
alj - a^ - but would reduce al - + 

Note that the resulting optimization problem is an LP problem since the objective function is 
linear in both ^jS and a+, cx~, and since the constraints are affine. There is a battery of well- 
established methods to solve this LP problem including interior-point methods and the simplex 
algorithm. An additional advantage of this formulation of the Voted Kernel Regularization algo¬ 
rithm is that there is a large number of generic software packages for solving LPs making the Voted 
Kernel Regularization algorithm easier to implement. 

5.2. Coordinate Descent (CD) formulation. An alternative approach for solving the Voted Ker¬ 
nel Regularization optimization problem (1) consists of using a coordinate descent method. The 
advantage of such a formulation over the LP formulation is that there is no need to explicitly store 
the whole vector of as but rather only non-zero entries. This enables learning with very large 
number of base hypotheses including scenarios in which the number of base hypotheses is infinite. 
The full description of the algorithm is given in Appendix C. 

5.3. Complexity penalties. An additional benefit of the learning bounds presented in Section 4 
is that they are data-dependent. They are based on the Rademacher complexity r^s of the base 
hypothesis sets H^, which in some cases can be well estimated from the training sample. Our 
formulation directly inherits this advantage. However, in certain cases computing or estimating 
complexities ri,..., may be costly. In this section, we discuss various upper bounds on these 
complexities that be can used in practice for efficient implementation of the Voted Kernel Regular¬ 
ization algorithm. 

Note that the hypothesis set = {x ^ ±JLfc(x, x'): x' G X} is of course distinct from the 
RKHS fKfc of the kernel K^. Thus, we cannot use the known upper bound on fH 5 (TCfc) to bound 
y{s{Hk)- Nevertheless our proof of the upper bound is similar and leads to a similar upper bound. 

Lemma 3. Let be the kernel matrix of the PDS kernel function for the sample S and let 
Kk = sup 3 ,g_:^. Then, the following inequality holds: 

m 

We present the full proof of this result in Appendix A. Observe that the expression given by the 
lemma can be precomputed and used as the parameter of the optimization procedure. 

The upper bound just derived is not fine enough to distinguish between different normalized 
kernels since for any normalized kernel Kk, n = 1 and Tr[Kfc] = m. In that case, finer bounds in 


min 

a+>0,Q!->0,^ 

S.t. 
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terms of localized complexities can be used. In particular, local Rademacher complexity of a set 
of functions H id defined as r) = Dlrn{{h G H ■. E[/i,^] < r}). If is a sequence 

of eigenvalues associated with the kernel Kk then once can show [Mendelson, 2003, Bartlett et ah, 

2005] that for every r > 0, r) < ^^ min 0 >o (^Or + EJli min(r, A^). 

Furthermore, there is an absolute constant c such that if Ai > — then for every r > —, 


C 

m 


7=1 


•JJ ^ 




Note that taking r = oo recovers earlier bound y{„i{Hk) < y/Tr[Kfc] /m. On the other hand one 
can show that for instance in the case of Gaussian kernels r) = 0{^./^\og(TJF)) and 

using the fixed point of this function leads to r) = These results can be used in 

conjunction with the local Rademacher complexity extension of Voted Kernel Regularization dis¬ 
cussed in Section 4. 

If all of the kernels belong to the same family such as, for example, polynomial or Gaussian 
kernels it may be desirable to use measures of complexity that would account for specific properties 
of the given family of kernels such polynomial degree or bandwidth of the Gaussian. Below we 
discuss several additional upper bounds that aim to address these questions. 

For instance, if Kk is a polynomial kernel of degree k, then we can use an upper bound on the 
Rademacher complexity of Hk in terms of the square-root of its pseudo-dimension Pdim(iFfc), 
which coincides with the dimension dk of the feature space corresponding to a polynomial kernel 
of degree k, which is given by 


kN + k\ ^ {N + k)’^ ^ nN + k)ey 

\ k J - k\ “V k J ' 


(3) 


Lemma 4. Let Kk be a polynomial kernel of degree k. Then, the empirical Rademacher complexity 
of Hk can be upper bounded as Dis{Hk) < ■ 

The proof of this result is in Appendix A Thus, in view of the lemma, we can use = K\y/dk 
as a complexity penalty in the formulation of the Voted Kernel Regularization algorithm with 
polynomial kernels, with dk given by the expression (3). 


6. Experiments 

We experimented with several benchmark datasets from the UCI repository, specifically breastcancer, 
climate,diabetes, german(numeric),ionosphere,musk, ocr49, phishing, retinopathy, 
vertebral and waveformOl. Here, ocr4 9 refers to the subset of the OCR dataset with classes 
4 and 9, and similarly wavef ormO 1 refers to the subset of waveform dataset with classes 0 and 1. 

More details on all the datasets are given in Table 2 in Appendix D. 

Our experiments compared Voted Kernel Regularization to regular SVM, that we refer to as 
L 2 -SVM, and to norm-1 SVM, called Li-SVM. In all of our experiments, we used lp_solve, 
an off-the-shelf LP solver, to solve the Voted Kernel Regularization and Li-SVM optimization 
problems. For L 2 -SVM, we used LibSVM. 

In each of the experiments, we used standard 5-fold cross-validation for performance evaluation 
and model selection. In particular, each dataset was randomly partitioned into 5 folds, and each 
algorithm was run 5 times, with a different assignment of folds to the training set, validation set 
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Error (%) 

Number of support vectors 

Dataset 

L2 SVM 

LI SVM 

VKR2 

VKRc 

L2 SVM 

LI SVM 

VKR2 

VKRc 


Mean 

Mean 

Mean 

Mean 

Mean 

Mean 

Mean 

Mean 


(Stdev) 

(Stdev) 

(Stdev) 

(Stdev) 

(Stdev) 

(Stdev) 

(Stdev) 

(Stdev) 

ocr49 

5.05 

3.50 

2.70 

3.50 

449.8 

140.0 

6.8 

164.6 

(0.65) 

(0.85) 

(0.97) 

(0.85) 

(3.6) 

(3.6) 

(1.3) 

(9.5) 

phishing 

4.64 

4.11 

3.62 

3.87 

221.4 

188.8 

73.0 

251.8 

(1.38) 

(0.71) 

(0.44) 

(0.80) 

(15.1) 

(7.5) 

(3.2) 

(4.0) 

waveformOl 

8.38 

8.47 

8.41 

8.57 

415.6 

13.6 

18.4 

14.6 

(0.63) 

(0.52) 

(0.97) 

(0.58) 

(8.1) 

(1.3) 

(1.5) 

(2.3) 

breastcancer 

11.45 

12.60 

11.73 

11.30 

83.8 

46.4 

66.6 

29.4 

(0.74) 

(2.88) 

(2.73) 

(1.31) 

(10.9) 

(2.4) 

(3.9) 

(1.9) 

german 

23.00 

22.40 

24.10 

24.20 

357.2 

34.4 

25.0 

30.2 

(3.00) 

(2.58) 

(2.99) 

(2.61) 

(16.7) 

(2.2) 

(1.4) 

(2.3) 

ionosphere 

6.54 

7.12 

4.27 

3.99 

152.0 

73.8 

43.6 

30.6 

(3.07) 

(3.18) 

(2.00) 

(2.12) 

(5.5) 

(4.9) 

(2.9) 

(1.8) 

pima 

31.90 

30.85 

31.77 

30.73 

330.0 

26.4 

33.8 

40.6 

(1.17) 

(1.54) 

(2.68) 

(1.46) 

(6.6) 

(0.6) 

(3.6) 

(1.1) 

musk 

15.34 

11.55 

10.71 

9.03 

251.8 

115.4 

125.6 

108.0 

(2.23) 

(1.49) 

(1.13) 

(1.39) 

(12.4) 

(4.5) 

(8.0) 

(5.2) 

retinopathy 

24.58 

24.85 

25.46 

24.06 

648.2 

42.6 

43.6 

48.0 

(2.28) 

(2.65) 

(2.08) 

(2.43) 

(21.3) 

(3.7) 

(4.0) 

(3.1) 

climate 

5.19 

5.93 

5.56 

6.30 

66.0 

19.0 

51.0 

18.6 

(2.41) 

(2.83) 

(2.85) 

(2.89) 

(4.6) 

(0.0) 

(6.7) 

(0.9) 

vertebral 

17.74 

18.06 

17.10 

17.10 

75.4 

4.4 

9.6 

8.2 

(6.35) 

(5.51) 

(7.27) 

(6.99) 

(4.0) 

(0.6) 

(1.1) 

(1.3) 


Table 1. Experimental results with Voted Kernel Regularization and polynomial 
kernels. VKRe refers to the algorithm obtained by using Lemma 3 as eomplexity 
measure, while VKR2 refers to the algorithm obtained by using Lemma 4. Indieated 
in boldfaee are results where the errors obtained are statistieally signifieant at a 
eonfidenee level of 5%. In italies are results that are better at 10% level. 


and test set for eaeh run. Speeifieally, for eaeh i G {0,..., 4}, fold i was used for testing, fold 
i + 1 (mod 5) was used for validation, and the remaining folds were used for training. Lor eaeh 
setting of the parameters, we eomputed the average validation error aeross the 5 folds, and selected 
the parameter setting with minimum average validation error. The average error across the 5 folds 
was then computed for this particular parameter setting. 

In the first set of experiments we used polynomial kernels of the form iTfc(x, y) = (x^y + 
1)^. We report the results in Table 6. Lor Voted Kernel Regularization , we optimized over A G 
{10“*: i = 0,..., 6} and (3 G {10“*: z = 0,..., 6} The family of kernel functions Hk for k G 
[1,10] was chosen to be the set of polynomial kernels of degree k. In our experiments we compared 
the bounds of both Lemma 3 and Lemma 4 used as an estimate of the Rademacher complexity. Lor 
Li-SVM, we cross-validated over degrees in range 1 through 10 and f3 in the same range as for 
Voted Kernel Regularization . Cross-validation for L 2 -SVM was also done over the degree and 
regularization parameter C G {10*: i = —4,..., 7}. 

On 5 out of 11 datasets Voted Kernel Regularization outperformed L 2 -SVM and Li-SVM with 
a considerable improvement on 3 data sets. On the rest of the datasets, there was no statistical 
difference between these algorithms. Note that our results are also consistent with previous stud¬ 
ies that indicated that Li-SVM and L 2 -SVM often have comparable performance. Observe that 
solutions obtained by Voted Kernel Regularization are often up to 10 times sparser then those of 
L 2 -SVM. In other words. Voted Kernel Regularization has a benefit of sparse solutions and often 
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an improved performance, which provides strong empirical evidence in the support of our formu¬ 
lation. In a second set of experiments we used families of Gaussian kernels based on distinct 
values of the parameter 7 G {10*; i = —6 ,..., 0}. We used the bound of Lemma 3 as an estimate 
of the Rademacher complexity. In our cross-validation we used the same range for A and fi pa¬ 
rameters of Voted Kernel Regularization and Li-SVM algorithms. For L 2 -SVM we increased the 
range of the regularization parameter: C G (10*: i = —4,. ..,7}. The results of our experiments 
are comparable to the results with polynomial kernels, however, improvements obtained by Voted 
Kernel Regularization are not always as significant in this case. The sparseness of the solutions are 
comperable to those observed with polynomial kernels. 

7. Conclusion 

In this paper we presented a new support vector algorithm - Voted Kernel Regularization . Our 
algorithm benefits from strong data-dependent learning guarantees that enable learning with highly 
complex feature maps and yet not overfit. We further improved these learning guarantees using lo¬ 
cal complexity analysis leading to an extension of Voted Kernel Regularization algorithm. The 
key ingredient of our algorithm is a new regularization term that makes use of the Rademacher 
complexities of different families of kernel functions used by the Voted Kernel Regularization al¬ 
gorithm. We provide a thorough analysis of several different alternatives that can be used for this 
approximation. We also provide two practical implementations of our algorithm based on linear 
programming and coordinate descent. Finally, we presented results of extensive experiments that 
show that our algorithm always finds solutions that are much sparse than those of the other support 
vector algorithms and at the same time often outperforms other formulations. 
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Appendix A. Proofs of Learning Guarantees 


Theorem 2. Assume p > 1. Fix p > 0. Then, for any <5 > 0, with probability at least 1 — <5 over 
the choice of a sample S of size m drawn Ltd. according to the following inequality holds for 
till f = tqht € for any K > 1: 


Rif) 


-j^RsAf) < 6K-^at^miHA 

p 2 40 A'log p 


+ 5k'^ 


+ 5K 


m 


m 


logP 

m 


Thus, for K 


2, R{f) < 2RsAf) + 7 Ef=i + O 




Proof For a fixed h = {hi,, hx), any ck G A defines a distribution over {hi,..., hx}. Sam¬ 
pling from [hi,..., hx} aeeording to ol and averaging leads to funetions g of the form g = 
^ Ya=i tttht for some n = {rii,... ,nx), with and ht G Hk^. 

For any N = (Ai,..., Nf} with |N| = n, we eonsider the family of funetions 

Gj^,TSI = j) £ [p] X [Nk],hk,j^Hk 

k=l j=l 


and the union of all sueh families Gx,n = U|N|=n ^^,n- Fix p > 0. We define a elass <F o = 
{<f)p(p) ; g G Gj-^n} and Qr = G>i,,x,'N,r = {tig/ max(r, Ffg] : G <F o Gj-^n} for r to be ehosen 

later. Observe that for Vg G j^Nr Var[up] < r. Indeed, if r > Ffg] then Vg = ig. Otherwise, 
Var|„J = Var[4]/(E[4])2 < r(E|g)/E|4] < r. 

By Theorem 2.1 in Bartlett et al. [2005], for any <5 > 0 with probability at least 1 — 5, for any 
0 < /3 < 1 , 


V < 2(1 + + 



1 ypg 1 

(3/ m 


where!/ = sup^gg^ (E[u] — E„[u]) and/9 is a free parameter. Next we observe that if lKm(^<i>,j',N,r) < 
2f{mi{c(ig- 9 ^ ‘l*oGj-,N,tt e [0,1]}) = 91m(<I)oGj-,N)- Therefore, using Talagrand’s eontraetion 
lemma and eonvexity we have that 91™^ YYi ^^miHk)- It follows that for any 
5 > 0 with probability at least 1 — 5, for all 0 < /5 < 1 


1 ^ TV 

L<2(l + /3)-V^91™(Afc) + 

p n 
t k=i 


'2 rlogf 


m 


+ 



log I 

m 


Sinee there are at most p” possible p-tuples N with |N| = n, by the union bound, for any 5 > 0, 
with probability at least 1 — 5, 


1 TV 

L<2(l + /3)-V—91™(Afc) + 

- 


k=l 


n 


IrhgSl 


m 


+ 


l^l5log^ 


- 




m 
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Thus, with probability at least 1—5, for all functions g = ^ Yli=i ‘^tht with ht G the following 
inequality holds 


y <2(l + /3)-+ 

Ptt ^ 



n I lyog^ 

v3 (3/ m 


Taking the expectation with respect to a and using Yjalut/n] = at, we obtain that for any 5 > 0, 
with probability at least 1 — 5, for all h, we can write 


E[V]<2{l + P)-^atd\,^{Hk,) + 

“ pti 




We now show that r can be chosen in such a way that Eq,[1/] < r/K. The right hand side of the 
above bound is of the form A^/r + B. Note that solution of r/K = C + A^/r is bounded by 
K‘^A^ + 2KC and hence by Lemma 5 in [Bartlett et ah, 2002] the following bound holds 

'j' ^ 

E|flp/2(9) - -j^RsAa)] < «'(1 + /3)1 + (2A-" + 2A-(i + 1))^^. 

a P t=l pm 


Set /3 = 1/2, then we have that 

T 1 

mpM - -J^RsAa)] < QK- y2 

ct A — i p m 

^ t=l 

Then, for any 5n > 0, with probability at least 1 — 5„, 

K ^ 1^ logf 

R[Rp/2{9) - -T^RsAa)] < QK- y\ + 5iT— 

a ' A — 1 p ^^ m 

^ t=i 

Choose 6n = 2 pn-i for some 5 > 0, then for p > 2, X]n>i = 2 {i-i/p) — Thus, for any 5 > 0 
and any n > 1, with probability at least 1 — 5, the following holds for all h: 

K ^ 1 ^ lo£^2!2li 

R[Rp/2{9) - -r^RsAa)] < 6iT- V atfAAHA + 5K ^ ^ . (4) 

a ' A — 1 p m 

^ t=i 

Now, for any / = Ylit=i ^tht G T and any g = ^ Y^=i ^tht, we can upper bound R{f) = 
P^(x,y)r^v[yfix) < 0], the generalization error of /, as follows: 

Rif) = , Pr [yfix) - yg{x) + yg{x) < 0] < Vi[yf{x) - yg{x) < -p/2] + Vi[yg{x) < p/2] 

{x,y)^V 

= ^AyfA) - ygA) < -p/2] + RpM- 

We can also write 

Rpig) = Rs,pi9 -/ + /)< P^ygiA - yfix) < -p/2] + Rspp/ 2 if)- 

Combining these inequalities yields 

Pr [yf{x) < 0] - Rs,3p/2if) < P^iyfix) - ygix) < -p/2] 

{x,y)'^'D i\ — 1 

+ Y^P^ygA) - yfiA < -p/2] + Rp/ 2 ig) - Y^RsAa)- 
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Taking the expectation with respect to ex. yields 

-^(/) ~ Rs,3p/2{f) < E [lyf^x^-yg^x)<-p/2] 

X'^U^OL 


K 


5 \^yg{x)-yf{x)<-p/2\~\~R[Rp/2{9) 


K 


K-1 


K — 1 x>^'D,q. 

Since / = Ea[9], by Hoeffding’s inequality, for any x, 

np^ 

R[hf{x)-yg{x)<-p/2]=^^[yf{x)-yg{x)<-p/2] < e“ — 

Q. Ct 

Tip^ 

R[^yg{x)-yf(,x)<-p/2]=^Ay9{x)-yf{x)<-p/2] < e“ —. 

OL O' 

Thus, for any fixed f e R, we can write 

ft(f ) - Rs.wU) < (i + + m,,2(g) - j^RsAq)] 


RsA9)]- 


Thus, the following inequality holds: 




sup (/?(/) - 


K 


K-1 


Rs,i 


K 


K 


< (l + + supE|B,,,(9) - j^RsMa)]. 


Therefore, in view of (4), for any 5 > 0 and any n > 1, with probability at least 1 — 5, the following 
holds for all / G 


K 


K 


K-1 


Trrr)'^"'’ + sk 


Jog 


2p^ 


P 


t=l 


m 


To conclude the proof we optimize over n, f: n ^ vie ”“+r; 2 n, which leads to n = (1/m) log(MM 2 /Mi). 
Therefore, we set 


n = 


p2 40/flogp 


to obtain that the following bound 


1 

- l^RsAf) < 

K — L p ^—' 

^ t=l 


40 


K logp 
p^ m 


+ 5iT 


log I 


m 


+ 5iT 


p^ ® dOii'logp 


logp 


m 


Thus, taking K = 2, simply yields 


RW < 2RsM) + - E + O ( log (i^^) + 

p \p^m Vloen/ m 


and the proof is complete. 


□ 


Lemma 3. Let be the kernel matrix of the kernel function Kkfor the sample S and let Kk = 
suPxGA' A R-kA, x). Then, the following inequality holds: 

«:fev/Tr[Kfc] 


^s{Hk) < 


m 































14 


VOTED KERNEL REGULARIZATION 


Proof. D\s{Hk) can be upper bounded as follows using the Cauehy-Sehwarz inequality: 


^Rs{Hk) = -E 

m 

= —E 
m <y 


= 

m o- 


sup y^^aisKk{xi,x') 
_ 1 ,+ 1 } 


= —E 
m o- 


sup y^^aisKk{xi,x') 
x'ex I 


i=l 


sup 


x'£X ' , 

1=1 


i=l 

y^^ai^kjxi) 


< —E 
m o- 


sup ||$fc(a:')||jcJ y^q-i<hfc(xi) 


x'ex 


2 = 1 






< If 

~ m 


1 E 


*- 2=1 

where we used in the last line Jensen’s inequality. 


CTiaj^kixi) ■ ^k{xj) 

*,i=i 




m 


□ 


Lemma 4. Let Kk be a polynomial kernel of degree k. Then, the empirical Rademacher complexity 
of Hk can be upper bounded as follows: 


^s{Hk) < 12 kI 



Proof By the proof of Lemma 3, we ean write 


^s{Hk) < 


^E 

m <T 




E 2=1 




2k1^s{HI), 


where Hi is the family of linear funetions Hi = {w i—)■ w • <hfc(x): || w|| j£j, < By Dudley’s 
formula [Dudley, 1989], we ean write 


ms{Hl) < 12 



\ogAf{e,Hl,L2m 


m 


de, 


where V is the empirieal distribution. Sinee Hi ean be viewed as a subset of a -dimensional 
linear spaee and sinee | w ■ $^( 0 ;) | < ^ for all x G A” and w e Hi, we have logAA(e, Hi, L 2 {T>)) < 
log ■ Thus, we ean write 


ms{Hl) < 12 





whieh eompletes the proof. 


□ 


Appendix B. Optimization Problem 

This seetion provides the derivation for VKR optimization problem. We will assume that 
Hi,Hp are p families of funetions with inereasing Rademaeher eomplexities ^Rrn{Hk), k G 
[l,p], and, for any hypothesis h G denote by d{h) the index of the hypothesis set it be¬ 

longs to, that is h G Hd(h)- The bound of Theorem 1 holds uniformly for all p > 0 and funetions 

/ G conv(lJ^^^ iffc) at the priee of an additional term that is in The eondition 

Er=i«i = 1 of Theorem 1 ean be relaxed to 'f2t=i Ot < 1- To see this, use for example a null 
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hypothesis (hf = 0 for some t). Sinee the last term of the bound does not depend on a, it suggests 
seleeting a. to minimize 


1 4 

= ~J2^y^ ELi c.thtixi)<p + “ 

i=l ^ t=l 


where rt = ^m{.Hd{ht))- Sinee for any p > 0, / and //p admit the same generalization error, we 
ean instead seareh for a > 0 with < 1/p whieh leads to 


1 

min — 

q!>o m 


El 


yiE,t=i<^thtixi)<i 






iX. ^at< 


t=i 


P 


The first term of the objeetive is not a eonvex funetion of a and its minimization is known to be 
eomputationally hard. Thus, we will eonsider instead a eonvex upper bound based on the Hinge 
loss: let <f)(— m) = max(0,1 —u), then !_„ < $(—m). Using this upper bound yields the following 
eonvex optimization problem: 


1 

mm — 
a >0 m 




T T 

- Pi ^ atht{xi)^ + 

t=i t=i 


s.t. 


^ p 


t=i 


(5) 


where we introdueed a parameter A > 0 eontrolling the balanee between the magnitude of the 
values taken by funetion $ and the seeond term. Introdueing a Lagrange variable (3 > 0 assoeiated 
to the constraint in (5), the problem can be equivalently written as 


1 

min — 

a >0 m 


EKi 


T T 

- Pi ^ athtixi)^ + ^(An + P)at. 
t=i t=i 


Here, (3 is a. parameter that can be freely selected by the algorithm since any choice of its value is 
equivalent to a choice of p in (5). Let {hk,j)k,j be the set of distinct base functions x xj). 

Then, the problem can be rewritten as F be the objective function based on that collection: 

““ i=l j=l t=l 

with OL = («!, ... ,aN) e and Aj = Xrj + (3, for all j G [1, N]. This coincides precisely with 
the optimization problem minci>o F{a.) defining Voted Kernel Regularization . Since the problem 
was derived by minimizing a Hinge loss upper bound on the generalization bound, this shows 
that the solution returned by Voted Kernel Regularization benefits from the strong data-dependent 
learning guarantees of Theorem 1 . 


Appendix C. Coordinate Descent (CD) Formulation 

An alternative approach for solving the Voted Kernel Regularization optimization problem (1) 
consists of using a coordinate descent method. A coordinate descent method proceeds in rounds. 
At each round, it maintains a parameter vector a. Let a.t = {oit,k,j)kj denote the vector obtained 
after f > 1 iterations and let ckq = 0. Let denote the unit vector in direction {k,j) in 
. Then, the direction ^ and the step p selected at the fth round are those minimizing F{oLt-i + 
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Table 2. Dataset statistics. 


Data set 

Examples 

Eeatures 

breastcancer 

699 

9 

climate 

540 

18 

diabetes 

768 

8 

german 

1000 

24 

ionosphere 

351 

34 

musk 

476 

166 

ocr4 9 

2000 

196 

phishing 

2456 

30 

retinopathy 

1151 

19 

vertebral 

310 

6 

waveformOl 

3304 

21 


rjekj), that is 

^ m / \ P 

max I 0, l-Vift-i - yiyjr]Kk{xi, Xj) KEE ^k\0!-t-l,j,k\ + ^k\V + Oit-l^kjl, 

^ i=l \ / j=l k=l 

where ft-i = ^ EL Xj). To find the best descent direction, a coordinate 

descent method computes the sub-gradient in the direction {k,j) for each {k,j) E [l,p] x [1, m]. 
The sub-gradient is given by 


h ELi + sgn(at_i,fc,j)Afc 


if (y.t—\^k,j 7^ 0 


5F{at-i,e,)= 0 


else if 




< Afc 


[ ^ ELi ^t,j,k,i - sgn ELi (l>t,j,k,^^k Otherwise . 

where = -yiKk{xi,Xj) if ELi ELi < 1 and 0 otherwise. Once 

the optimal direction j is determined, the step size yt can be found using a line search or other 
numerical methods. 

The advantage of the coordinate descent formulation over the LP formulation is that there is 
no need to explicitly store the whole vector of as but rather only no-zero entries. This enables 
learning with very large number of base hypotheses including scenarios in which the number of 
base hypotheses is infinite. 


Appendix D. Dataset Statistics 


The dataset statistics are provided in Table 2 








